翻訳と辞書
Words near each other
・ Document clustering
・ Document collaboration
・ Document comparison
・ Document composition
・ Document Content Architecture
・ Document conversion
・ Document Definition Markup Language
・ Document dump
・ Document engineering
・ Document examiner
・ Document Exploitation (DOCEX)
・ Document file format
・ Document Freedom Day
・ Document imaging
・ Document International Human Rights Documentary Film Festival
Document layout analysis
・ Document management system
・ Document mode
・ Document modelling
・ Document mosaicing
・ Document Number Nine
・ Document Object Model
・ Document of the Dead
・ Document processing
・ Document processor
・ Document Records
・ Document retrieval
・ Document review
・ Document Schema Definition Languages
・ Document Structure Description


Dictionary Lists
翻訳と辞書 辞書検索 [ 開発暫定版 ]
スポンサード リンク

Document layout analysis : ウィキペディア英語版
Document layout analysis

In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.
Document layout analysis is the union of geometric and logical labeling. It is typically performed before a document image is sent to an OCR engine, but it can be used also to detect duplicate copies of the same document in large archives, or to index documents by their structure or pictorial content.
Document layout is formally defined in the international standard ISO 8613-1:1989.
== Overview of Methods ==
There are two main approaches to document layout analysis. Firstly, there are bottom-up approaches which iteratively parse a document based on the raw pixel data. These approaches typically first parse a document into connected regions of black and white, then these regions are grouped into words, then into text lines, and finally into text blocks. Secondly, there are top-down approaches which attempt to iteratively cut up a document into columns and blocks based on white space and geometric information.〔
The bottom-up approaches are the traditional ones, and they have the advantage that they require no assumptions on the overall structure of the document. On the other hand bottom-up approaches require iterative segmentation and clustering, which can be time consuming.〔 Top-down approaches are newer, and have the advantage that they parse the global structure of a document directly, thus eliminating the need to iteratively cluster together the possibly hundreds or even thousands of characters/symbols which appear on a document. They tend to be faster, but in order for them to operate robustly they typically require a number of assumptions to be made about on the layout of the document.〔
There are two issues common to any approach at document layout analysis: noise and skew. Noise refers to image noise, such as salt and pepper noise or Gaussian noise. Skew refers to the fact that a document image may be rotated in a way so that the text lines are not perfectly horizontal. It is a common assumption in both document layout analysis algorithms and Optical character recognition algorithms that the characters in the document image are oriented so that text lines are horizontal. Therefore, if there is skew present then it is important to rotate the document image so as to remove it.
It follows that the first steps in any document layout analysis code are to remove image noise and to come up with an estimate for the skew angle of the document.

抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)
ウィキペディアで「Document layout analysis」の詳細全文を読む



スポンサード リンク
翻訳と辞書 : 翻訳のためのインターネットリソース

Copyright(C) kotoba.ne.jp 1997-2016. All Rights Reserved.